(define print-info #f)
Color Soothing? 1 Green Yes 2 Pink Yes 3 Yellow Yeswould result in the decision tree:
yes
The discretize procedure you will write will be specific to the "Wisconsin Breast Cancer Database". This problem is not about writing a procedure that will work on any training data with continuous valued attributes. In order to write this procedure, you have several training data sets of this data. You can, for example, write a procedure which will test potential threshold values for an attribute to turn it from a continuous valued attribute into a discrete valued attribute.
You can discretize one or all of the attributes in this data set. You should describe what you did to write your discretize procedure. If you've written code to help you analyze the training data to determine threshold values, include it. Half the points for this problem will be based on the written part, and half will be based on how well your discretize procedure works --- the web tester will use it to learn and then test a decision tree. The delay in getting the web tester up for this problem is that I need to determine what are reasonable performance levels for this problem.
See the assignment handout and the comments in this file for the data formats.
(define dt (learn-dtree mushroom-data1 mushroom-names)) ;Value: dt dt ;Value: (odor (c p) ; (p p) ; ...) ; take the 10th example from the second data set (define t (list-ref mushroom-data2 9)) ;Value: t (car t) ; the correct classification ;Value: p (second t) ; the attribute values ;Value: (x s n t p f c n w e e s s w w p w o p n s g) (classify (second t) dt mushroom-names 'attribute-not-found) ;Value: p ; it was right! ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; test the learned decision tree on the second data set ; (test-dtree dt mushroom-data2 mushroom-names) The example: (f s w t f f c b w t b f s w w p w o p h v u) was classified as: p CORRECT ... Out of 100 examples, 96 were correctly classified.
The training data sets for this problem are from the "Wisconsin Breast Cancer database". The examples in this database consist of measurements of a tissue sample; the classification is whether the sample is malignant or benign.
The measurements were done automatically from digitized images. You can see some of the original images at http://dollar.biz.uiowa.edu/~street/xcyt/images/.
There are 30 continuous valued attributes for each example. You must write a procedure:
(discretize attribute-values)which will take a list of the 30 attributes for an example from this database and return a list of attribute values that can be used by my learn-dtree procedure for Problem 2. Documentation for the data set is in the bc-data.txt file. The data set along with some support code to help you test your procedure are in the file bc-data.scm.
Here's an example discretize function (that doesn't purport to be any good since I just made up these threshold values):
(define (discretize attribute-values) (list (if (> (list-ref attribute-values 3) 1000) ; check fourth element 'area>1000 'area<=1000) (if (> (list-ref attribute-values 9) 0.06) ; check tenth element 'fracdim>0.06 'fracdim<=0.06)))Note that this example takes the list of 30 attribute values and returns a list of two attribute values.
Here's how you could test your discretize procedure using the procedures I've provided in the bc-data.scm file:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; learn the decision tree from discretized data ; (define dt (ldt-disc discretize bc-data-m1)) ;Value: dt dt ;Value: (1 (area>1000 m) ; (area<=1000 (2 (fracdim<=0.06 b) ; (fracdim>0.06 b)))) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; classify a single example with the learned decision tree ; ; attribute values only of the first example from another data set (define ex (second (first bc-data-m2))) ;Value: ex ex ;Value 17: (11.22 33.81 70.79 386.8 .0778 .03574 ... ) (discretize ex) ;Value 18: (area<=1000 fracdim<=0.06) (classify-disc discretize ex dt 'm) ;Value: b ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; test the learned decision tree against another data set ; (testdt-disc discretize dt bc-data-s) The example: (area<=1000 fracdim<=0.06) was classified as: b CORRECT The example: (area>1000 fracdim>0.06) was classified as: m CORRECT The example: (area<=1000 fracdim>0.06) was classified as: b WRONG: the correct classification is m. The example: (area<=1000 fracdim>0.06) was classified as: b WRONG: the correct classification is m. ... Out of 20 examples, 11 were correctly classified.